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Abstract 

Background: Loss of CpG dinucleotides in genomic DNA through methylation-induced mutation is characteristic 
of vertebrates and plants. However, these and other eukaryotic phyla show a range of other dinucleotide frequency 
biases with currently uncharacterized underlying mutational or selection mechanisms. We developed a 
parameterized Markov process to identify what neighbour context-dependent mutations best accounted for 
patterns of dinucleotide frequency biases in genomic and cytoplasmically expressed mRNA sequences of different 
vertebrates, other eukaryotic groups and RNA viruses that infect them. 

Results: Consistently, 1 1- to 14-fold greater frequencies of the methylation-associated mutation of C to T upstream 
of G (depicted as C^T,G) than other transitions best modelled dinucleotide frequencies in mammalian genomic 
DNA. However, further mutations such as G^T,T (5-fold greater than the default transversion rate) were required to 
account for the full spectrum of dinucleotide frequencies in mammalian sequence datasets. Consistent with 
modeling predictions for these two mutations, instability of both CpG and CpT dinucleotides was identified 
through SNP frequency analysis of human DNA sequences. Different sets of context-dependent mutations were 
modelled in other eukaryotes with non-methylated genomic DNA. In contrast to genomic DNA, best-fit models of 
dinucleotide frequencies in transcribed RNA sequences expressed in the cytoplasm from all organisms were 
dominated by mutations that eliminated UpA dinucleotides, observations consistent with cytoplasmically driven 
selection for mRNA stability. Surprisingly, mRNA sequences from organisms with methylated genomes showed 
evidence for additional selection against CpG through further context-dependent mutations (eg. C— >A,G). Similar 
mutation or selection processes were identified among single-stranded mammalian RNA viruses; these potentially 
account for their previously described but unexplained under-representations of CpG and UpA dinucleotides. 

Conclusions: Methods we have developed identify mutational processes and selection pressures in organisms that 
provide new insights into nucleotide compositional constraints and a wealth of biochemical and evolutionarily 
testable predictions for the future. 
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Background 

One of the most striking compositional abnormalities in 
DNA sequences of mammalian and other vertebrate gen- 
omic DNA sequences is the marked under-representation 
of CpG and over-representation of CpA and TpG dinucle- 
otides. This compositional abnormality was first recog- 
nized over 50 years ago [1-3] and is now generally 
accepted to result directly from the mutagenic effect of 
methylation of cytosine (mC) bases in CpG dinucleotides. 
mC is more likely to deaminate to thymine [4,5] so deplet- 
ing CpG dinucleotides and increasing the frequencies of 
TpG and CpA on the opposite strand through mismatch 
repair. 

In more general terms, dinucleotide compositional ab- 
normalities reflect either context-sensitive differences in 
mutation rates (as in the case of DNA methylation and 
CpG under-representation) or specific selection for or 
against certain dinucleotides. As an example of the latter, 
the UpA dinucleotide is targeted by RNA-degrading 
enzymes and its presence in an RNA sequence acceler- 
ates its degradation in the cytoplasm. UpA composition 
therefore modulates protein expression from mRNA 
through its influence on transcriptome turnover [6,7]. 
The widespread suppression of UpA dinucleotides in 
mRNA sequences may therefore reflect selection for in- 
creased stability in the cytoplasm. 

In spite of these two well known examples, it remains 
unclear whether the combination of mutational biases 
against CpG in genomic DNA and selection against 
UpA in mRNA accounts for the complex pattern of 
over- and under-representation of each of the 16 dinu- 
cleotides in vertebrates. Secondly it remains unexplained 
why the degree of CpG under-representation is inversely 
proportional to the G+C content of the underlying 
sequence, although it has been speculated that there are 
differences in the accessibility of genomic DNA in high 
and low G+C regions to deamination [8,9]. Thirdly, the 
observation made many years ago that many RNA 
viruses under-represent CpG dinucleotides despite the 
absence of a specific (methylation-dependent) muta- 
tional pathway for RNA has remained unexplained 
[10-12]. Patterns of CpG and UpA under-representation 
among viruses infecting hosts with different degrees of 
host genomic DNA methylation have remained similarly 
unexplored. Finally, eukaryotes with non-methylated 
genomes show different patterns of dinucleotide repre- 
sentation (such as elevated frequencies of ApA in 
ecdysozoa) for which neither a mutational nor a selec- 
tionist mechanism has yet been proposed. 

Using data from a range of eukaryotes with different 
methylation patterns, Simmen showed that the degree of 
over-representation of CpA and TpG dinucleotides were 
in proportion to the expected frequency created by 
C— >T transitions in methylated DNA [13]. In Duret and 



Galtier [14], an explicit mathematical model was devel- 
oped to investigate whether frequent CpG -context de- 
pendent mutations could account for the suppression in 
frequencies of TpA in human DNA sequences. Assign- 
ment of an elevated C— »T transition rate reproduced the 
CpG deficit (and G+C dependence) observed in mam- 
malian DNA and indirectly depleted TpA dinucleotide 
frequencies. However, this model failed to account for 
the full extent of TpA depletion observed in human 
DNA sequences and the model was not applied to inves- 
tigate the effect of this single mutational bias on other 
dinucleotide frequencies, such as TpG and CpA that also 
show compositional biases. How well this model might 
fully recreate the dinucleotide profile of human DNA 
remains unresolved. 

In the current study we have developed an extended 
model of sequence evolution that allows separate muta- 
tion rates for each type of transition and transversion in 
each dinucleotide context against a background, separ- 
ately optimized mean transition / transversion ratio (/<:). 
This model generalizes Duret and Galtier s model [14], 
in which k was fixed at 2.1 and only one context 
dependent mutation, (C— >T,G) was allowed to take a 
higher mutation rate. (This rate was based on observa- 
tional data available at the time of the study on sequence 
variability in human DNA sequences.) Our approach in 
contrast allowed up to 48 (or 96 for RNA) different di- 
nucleotide context dependent mutations and optimised 
rates to maximise the fit between model predictions and 
observed frequencies of all 16 dinucleotides. In the spe- 
cific case of analysing human DNA, the mutation C— »T, 
G and a transition rate of around 12 were discovered by 
the modelling rather than being imposed a priori. This 
analysis was also extended to the corresponding mRNA 
sequences to investigate whether additional or different 
mutational or selection pressures were exerted in cyto- 
plasmically expressed sequences. 

Modelling was extended to other mammalian DNA 
and mRNA datasets, organisms showing largely absent 
genomic DNA methylation (fish, insects, nematodes) 
and mammalian RNA viruses in which the phenomenon 
of CpG under-representation has been previously de- 
scribed [10,11]. Modelling was naturally restricted to 
processes showing global effects on DNA composition 
and was unsuited for modelling effects of genome modi- 
fications with specific functional roles. The latter include 
the recently discovered role of DNA methylation in the 
gene expression and development pathways of the honey 
bee (Apis mellifera) and other insects [15,16] that possess 
primarily non-methylated genomes. Modelling was also 
restricted to mutational processes or selection operating 
in dinucleotide contexts. While methylation (and associ- 
ated mutations) primarily occurs in a CpG context in ver- 
tebrates and where studied in other eukaryotic groups, 
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plant genomes are additionally heavily methylated (50- 
80%) in the CpA/T/CpG trinucleotide context [17]. As 
this potentially exerts a significant additional mutational 
pressure on plant genomic DNA and cannot be modelled, 
the analysis of plant genome sequences was excluded from 
the current study. 

On the larger genomic scale, we obtained evidence 
both for mutational processes acting on genomic DNA 
beyond simple methylation-induced hypermutation and 
for a range of additional likely selection pressures on 
mRNA that centre around the elimination of UpA dinu- 
cleotides. The existence of this selection pressure and its 
occurrence in RNA viruses provides evidence for a series of 
novel compositional constraints in the cytoplasm on viral 
RNA. Specific dinucleotides may be selected against to es- 
cape currently uncharacterized self/non-self recognition 
mechanisms that are coupled to the interferon system 
(in mammals) and potential parallel defence mechanisms 
in other eukaryotic phyla. 

Results 

Patterns of dinucleotide frequencies in genomic DNA 
and mRNA 

Ratios of observed to expected frequencies of dinucleo- 
tides were computed for DNA genomic sequences of 
several different eukaryotes and their corresponding 
mRNA sequences. DNA datasets were restricted to se- 
quences that were non-transcribed since mRNA sequences 
encoded by genomic DNA that enter the cytoplasm may be 
subject to additional selection pressures. These represent a 
relatively small component of mammalian DNA sequences 
(H. sapiens, P. troglodytes and M. musculus in the current 
study; 1.2-2.0%) but the proportion of cytoplasmically 
expressed sequences was much larger in other vertebrates 
and other eukaryotic phyla (6.5% - 28%). 

As anticipated, frequencies of CpG dinucleotides in 
non-transcribed genomic DNA sequences from eukaryotic 
genomes showing extensive methylation (H. sapiens, and 
D. rerio [zebra fish]; Figures 1A, IB) were substantially 
lower than expected from their G+C content. No such 
reduction was evident in DNA sequences of the mosquito, 
A. gambiae (Figure 1C) whose genome is largely unmeth- 
ylated. Results from other mammals (P. troglodytes and 
M. musculus) were in practical terms identical to human 
DNA sequences) while other organisms without methyla- 
tion of genomic DNA sequences {Caenorhabditis elegans 
[a nematode], Drosophila melanogaster [fruit fly]) showed 
no under-representation of CpG dinucleotides (data not 
shown). 

G+C contents of the subset of mRNA sequences were 
higher than non-cytoplasmically expressed sequences 
(Figure ID, IE, IF). Several further differences between 
DNA and mRNA sequences were apparent in their 
dinucleotide compositions and their relationship with 



G+C composition. For example, for sequences with a given 
G+C content, UpA under-representation was greater in 
mRNA sequences than genomic sequences of humans 
(Figure 1A, ID; p < 10" 10 by Student t-test (Additional 
file 1: Table SI)). Even more evidently, UpA frequencies 
followed a quite different relationship with G+C content 
in A. gambiae and CpG frequencies in mRNA were sub- 
stantially higher than in genomic DNA (Figure 1C, IF). 
These observations are consistent with the existence 
of additional selection pressures on the subset of se- 
quences expressed as mRNAs. 

Compositional biases extended to other dinucleo- 
tides in humans (Figure 2A, 2B) and other organisms 
(Additional file 2: Figure SI). Several instances of compos- 
itional asymmetries are evident in complementary dinu- 
cleotides in mRNA sequences, such as the higher 
frequencies of UpC dinucleotides compared to GpA and 
in the UpG/CpA and GpG/CpC pairs (Additional file 2: 
Figure SIB). 

Fitting the mutational model to observed dinucleotide 
frequencies 

Observations of differing dinucleotide representations in 
DNA and mRNA sequences and the asymmetries be- 
tween complementary pairs in mRNA justified the devel- 
opment of separate mutational models for DNA and 
mRNA sequences. To investigate which context depen- 
dent substitutions could account best for the pattern of 
dinucleotide under- and over-representation in each 
sequence dataset, we developed a Markov process pa- 
rameter estimation method. This evaluated every possible 
substitution with each upstream and downstream neigh- 
bouring base and an associated mutation rate that maxi- 
mized the fit between modelled and observed dinucleotide 
frequencies. The degree of fit was quantified by calculation 
of root mean square (RMS) distances between modelled 
frequencies for sequences of different G+C contents and 
those of a sample of actual sequences. 

For RNA, there were 96 possible context-dependent 
mutations, while the symmetry of DNA allowed 48 
(eg. C— >T,G is formally equivalent to C,G— >A; see 
Methods). For both datasets separate optimization of 
transition / transversion ratios (k) represented an add- 
itional parameter applied to mutations that was incor- 
porated into the modelling process. The parameters 
that produced the lowest RMS distance for all 16 di- 
nucleotides was selected. Since the model was fitted 
to sequences with a range of G+C contents, RMS dis- 
tances are additionally influenced by how well the 
model reproduces the marked G+C dependence in 
the frequencies of certain dinucleotides (such as CpG 
and UpA). 

Having identified the context-dependent mutation, its 
rate and k that best fitted the observational data, the 
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DNA sequences 

A) H. sapiens B) D. rerio C) A. gambiae 
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Figure 1 G+C composition (x-axis) and frequencies of CpG and TpA (or UpA) dinucleotides in representative organisms with 
methylated (H. sapiens, D. rerio) and non-methylated genomes (A. gambiae), labelled in panels. Symbols for CpG and UpA dinucleotide 
frequencies (blue and red dots respectively; see inset box) and were expressed as the ratio of observed frequency / frequency expected from 

mononucleotide (base) composition of the fragment. 

v. ) 



mutation (although not its rate) was fixed and the 
analysis repeated to find a second context-dependent 
mutation that in combination with the first and a re- 
optimised value of k, created the greatest further reduc- 
tion in RMS distances. This procedure was repeated 
with further context-dependent mutations until there 
was no further reduction in RMS distance. Fitting two 
parameters to human DNA and mRNA sequences led to 
a better match between the model predictions for UpA 
and CpG dinucleotide frequencies (red and blue lines re- 
spectively) across a range of G+C compositions than 
achieved using one parameter. Using four parameters 
provided a better match than two and indeed modelled 
CpG and UpA frequencies very closely matched the 
quadratic line of best fit through the observational data 
(black lines; Figure 2). 



Matches between model predictions and observed fre- 
quency data extended to other dinucleotides in human 
DNA sequences and in most cases also successfully 
reproduced relationships between dinucleotide frequen- 
cies and G+C content (Additional file 2: Figure S1A). 
Similarly close fits between modelled and observational 
data for the 16 dinucleotides were observed for human 
mRNA sequences (Additional file 2: Figure SIB). Model 
predictions additionally reproduced the observed differ- 
ences in frequencies of self-complementary dinucleotides 
(such as CpA and UpG), represented as red/yellow/ white 
and dark blue/light blue/white filled symbols. 

Modelling was extended to DNA and RNA datasets 
for the other organisms (data from a fish and an insect 
are shown in Additional file 2: Figure SIC, SID, S1E and 
S1F). Similarly close fits between modelled and observed 
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A)Human DNA 



B) Human mRNA 




Figure 2 Observed / expected CpG and UpA frequencies in (A) human DNA and (B) mRNA sequences as a function of G+C content. 

Frequencies of each dinucleotide predicted from mutational models with 1, 2 and 4 parameters (1p, 2p and 4p, labelled according to the inset 
box) were superimposed on observed distributions of CpG and UpA dinucleotides (blue and red dots respectively; see inset box). Quadratic lines 
of best fit through observed distribution (black lines) were matched to model predictions over a G+C composition range from 20%-80%. 



frequencies were observed at each using up to 4 parame- 
ters. The main difference from human sequences was 
the much more restricted range of G+C contents of both 
DNA and mRNA sequences in each that made fitting 
the data to G+C compositional trends less relevant. 

Quantifying model error 

To quantify how well our model fitted the observational 
data, RMS distances between the observed dinucleotide 
frequencies and those predicted by the model were 
calculated for all 16 dinucleotides. These were then 
compared with the corresponding minimum RMS dis- 
tances between the observed dinucleotide frequencies 
and a separate quadratic model of each dinucleotide. 
These quadratic models yield the best possible fit to the 
data and lowest possible RMS value. Any other model 
will have higher RMS values and the amount by which 
its RMS values are above the quadratic models' RMS 
values shows the error in the model. We refer to this as 
the baseline corrected model error and this is used in 
the presentation of the RMS results. As an example, the 
best fit model data for mammalian DNA using 3 
mutation rates had a RMS distance for all 16 dinucleo- 
tides of 0.0378 while quadratic best fit data showed a 
RMS distance of 0.0275. The baseline-corrected model 
error was therefore 0.0103 (0.0378 - 0.0275). The calcu- 
lation of baseline corrected model errors therefore ex- 
cludes measurement errors associated with dinucleotide 



frequency measurements of often relative short nucleo- 
tide sequences. 

The effect of sequence length on RMS calculations can 
be visualised by comparison of the degree of scatter of 
dinucleotide frequencies of human mRNA sequences 
(mean length of approximately 2463 bases) with that of 
the much longer DNA sequences (50,000 bps; Figures 1A, 
ID, 2 A, 2B). To more formally demonstrate the relation- 
ship between RMS scores and sequence lengths, human 
DNA sequences of lengths ranging from 400,000 to 500 
bps were generated and model error estimated for each 
dataset using separate modelling to minimum values 
(Additional file 3: Figure S2 in Supplementary Data). An 
empirical relationship between sequence length and RMS 
distance can be represented as: 

RMS distance = (2.2/length 0 42 ) + 0.0095 

The intercept with the y-axis of 0.0095 therefore repre- 
sents model error for DNA fragments of infinite length 
(ie. not attributable to sampling error). For DNA frag- 
ments of 50,000 bps in length, sampling error can be 
estimated to contribute 0.0329 to RMS distances, while 
for human mRNA sequences, sampling error was three 
times higher at 0.0994. These values are close to RMS dis- 
tances calculated from lines of best fit to the data (0.0275 
and 0.0972 respectively). This close match for human 
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sequences was reproduced in corresponding datasets for 
other organisms (Additional file 4: Table S2). 

Effectiveness of context-dependent mutational 
rate modelling 

Model errors for the first four most influential mutations 
and the minimum value were calculated for human, fish 
and insect datasets (Figure 3). All values were baseline 
corrected by subtraction of RMS scores of quadratic 



lines of best fit through observational data (uncorrected 
RMS scores are shown in Additional file 5: Figure S3). 
For non-cytoplasmically expressed human DNA se- 
quences, corrected model errors fell from an initial value 
of 0.228 (no context-dependent mutations) to 0.0024 
(minimum value achieved with 9 parameters; Figure 3A). 
The two most influential context-dependent mutations 
were C— >T,G (model error reduction to 0.100) followed by 
G— >T,T (0.019) with minimal proportionate reductions 
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Figure 3 Model errors (y-axis) for mutational models with between 1 to 4 context-dependent mutational biases for vertebrate (human, 

D. rerio) and non-vertebrate (A gambiae) DNA and mRNA sequences. Minimum RMS distances for up to 16 additional mutational biases are 
shown as a dotted line. Model error reductions for alternative mutational biases are shown as unfilled circles. Mutations that remove CpG and T/ 
UpA are shown in pink and blue inset boxes respectively. All model error values have been baseline corrected by subtraction of RMS scores of 
quadratic lines of best fit through observational data. 
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using further mutations. These model error reductions 
correspond to the successively better fits between mod- 
elled and observed frequencies for human DNA and 
mRNA for UpA and CpG dinucleotides displayed in 
Figure 2. Both the mutations and their mutation rates 
were highly reproducible on replicate sampling of DNA 
sequences (Additional file 6: Table S3). Similarly, for or- 
ganisms such as the three mammalian species in which 
we suspect similar mutational biases and selection pres- 
sures may exist, the first three context-dependent muta- 
tions were, with one exception, identical while model 
error reductions, values of k and mutation rates were 
highly similar (Table 1). 

A further insight into the robustness of these predic- 
tions was obtained by plotting out baseline corrected 
model error for 2 nd , 3 rd and 4 th ranked alternative 
context-dependent mutations. For human DNA, C— »T,G 
and G— »T,T led to a substantially greater reduction than 
alternatives despite their frequent similarities in their 
effects on sequence composition. For example, 2 nd and 
3 rd alternatives to C— »T,G also eliminated CpG residues 
from sequences (shaded pink boxes; Figure 3A). The 
same findings were obtained on analysis of DNA se- 
quences of other mammalian genomes (M. musculus 
and P. troglodytes; data not shown). 

Context-dependent mutations and single nucleotide 
polymorphism (SNP) frequencies 

The consistent prediction in mammalian datasets of the 
G— »T,T mutation, ranked second, was unexpected and 
did not correspond to any characterized mutational bias 
in mammalian genomes. To investigate whether there 
was greater mutability of the GpT dinucleotide in hu- 
man DNA sequences, we compiled a large dataset of ap- 
proximately 45 million SNPs compiled from the NCBI 
dbSNP database and compiled frequencies of each pos- 
sible mutation (ie. A<->C, A<->G, G<->T) subdivided 

into groups according to the base downstream of the 
SNP (3' dinucleotide context). These frequencies were 
normalized by frequencies of each dinucleotide and of 
each mutation in the human SNP dataset to calculate 
the influence of dinucleotide context on mutation fre- 
quencies. As expected, SNP mutations showed a strong 
preference for the first ranked C— >T,G mutation (8.6x 



expected frequency) predicted from modeling and add- 
itionally the C— >A,G and C— >G,G alternative mutations 
(3.5 x and 4.0x respectively; Figure 4). Consistent with 
the second ranked mutation detected on modelling, con- 
sistently elevated mutational frequencies were observed in 
the GpT dinucleotide (1.6x - 1.8 x) although in this case 
there was a less clear bias among the three possible muta- 
tions for the G<->T transversion predicted in the model. 
The instability of GpT identified by SNP analysis supports 
the prediction of an elevated G— »T,T mutation rate identi- 
fied by modelling and by the under-representation of GpT 
in human (Additional file 2: Figure S1A; -80% of expected 
value) and in other mammalian sequence datasets (data 
not shown). 

Mutational biases in mRNA sequences 

Differences in dinucleotide composition between human 
non-cytoplasmically expressed genomic DNA sequences 
and mRNA sequences were reflected in different best-fit 
mutational models between DNA and RNA sequences 
(Figure 3). The composition of mRNA sequences is 
influenced by mutational pressures operating on the 
underlying DNA sequences, as well as possible muta- 
tional biases introduced by RNA polymerase II and by 
selection pressures in the cytoplasm. While the most in- 
fluential mutation was C— »U,G, along with alternatives 
that also removed CpG dinucleotides, the second (and al- 
ternatives) all removed UpA dinucleotides, a mutational 
or selection pressure absent in mammalian DNA se- 
quences. mRNA showed further mutations that removed 
CpG and UpA dinucleotides, consistent with a greater, 
possibly cytoplasmically-driven selection pressure to re- 
move these two dinucleotides (see Discussion). Evidence 
for greater complexity of the mutation and/or selection 
pressures operating on mRNA sequences was provided by 
the greater number of mutations needed to reduce model 
error and larger minimum value from the best fitted model. 

For DNA sequences of other species, mutations re- 
moving CpG dinucleotides were found among those 
with methylated genomes (D. rerio; Figure 3C along with 
the sea squirt, C. intestinalis) as expected but were en- 
tirely absent among A. gambiae sequences (Figure 3E) 
and other organisms with non- or weakly- methylated ge- 
nomes (D. melanogaster, C. elegans; data not shown). 



Table 1 Comparison of the first three context-dependent mutations and mutation rates in three mammalian species 



Seq. 


Species 


K 


Corrected model error 


1 st 


Rate 




Rate 


3 rd 


Rate 


DNA 


H. sapiens 


3.1 


0.0135 


C^T,G 


12.06 


G— >T,T 


6.16 


A— >G,T 


1.42 




P. troglodytes 


4.6 


0.0134 


C^T,G 


11.29 


G— >T,T 


6.01 


A— >G,T 


1.30 




M. musculus 


4.6 


0.0117 


C^T,G 


14.33 


G— >T,T 


4.30 


C^G,A 


2.20 


RNA 


H. sapiens 


1.7 


0.0249 


C^U,G 


10.26 


U,A^C 


9.53 


C^A,G 


9.40 




P. troglodytes 


1.6 


0.0265 


C^U,G 


10.17 


U,A^C 


11.93 


C->A,G 


10.00 




M. musculus 


1.6 


0.0183 


C^U,G 


10.84 


U,A^C 


9.48 


C^A,G 


10.73 
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Figure 4 Frequencies of SNPs (y-axis) occurring in each 
dinucleotide context (x-axis) compiled from approximately 45 
million SNPs in human genomic DNA sequences. These were 
categorised by mutation type and by the base downstream of the 
SNP (3' dinucleotide context). Frequencies of SNPs were normalised 
to those predicted from dinucleotide frequencies and transition and 
transversion rates measured in the whole SNP dataset. 



mRNA sequences from all species showed a predomin- 
ance of mutations that removed UpA, consistent with 
widespread cytoplasmically driven selection against this 
dinucleotide. 

To compare mutational and/or selection pressures op- 
erating against CpG and UpA dinucleotides in different 
organisms, modelled mutation rates (-fold excess over 
default values) were calculated for the most influential 



mutations that remove these in each species (Figure 5). 
This analysis confirmed the absence of mutational or se- 
lection pressures against CpG dinucleotides in DNA or 
mRNA sequences in any of the organisms with non- 
methylated genomes (ecdysozoa). In contrast, selection 
against UpA dinucleotides was universal in mRNA se- 
quences of all organisms examined and occurred at an 
optimized rate that was invariably several fold higher 
than observed in corresponding DNA sequences. Muta- 
tions removing TpA was indeed absent in all mamma- 
lian datasets and in C. intestinalis among the first four 
parameters that were most influential in reducing model 
error. 

Unexpectedly, greater mutational rates in mRNA se- 
quences compared to non-cytoplasmically expressed 
DNA sequences were also observed for CpG dinucleo- 
tides, where modelled rates were consistently higher 
(despite the existence of the methylation-induced muta- 
tional pathway operating on genomic DNA sequences of 
vertebrates). The existence of an additional selection 
pressure imposed on cytoplasmically expressed sequences 
was consistent with the existent of two mutations rather 
than one (C— >U,G [1st parameter] and C— »A,G [3 rd par- 
ameter]) in human mRNA sequences and in other mam- 
malian mRNA datasets (P. troglodytes, M. musculus; 
Table 1; Figure 3B). 

Modelling mutational and selection biases in mammalian 
RNA viruses 

RNA viruses replicate in the cytoplasm of a wide range 
of eukaryotes and are potentially susceptible to the same 
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selection pressures observed in host mRNA sequences. 
To investigate this, dinucleotide compositions in complete 
genome sequences from a wide range of RNA and small 
DNA viruses infecting mammals and insects were calcu- 
lated. Consistent with previous analyses [10], most classes 
of RNA virus and small DNA viruses showed evidence of 
marked CpG suppression (Figure 6) and a wide range of 
under-and over-representation of other dinucleotides 
(Additional file 7: Figure S4A, S4B). No suppression of 
CpG was apparent among insect viruses. RNA viruses 
were subdivided into groups based on the configuration of 
their genomic RNA (based on the Baltimore classification) 
and potential exposure to the cytoplasm. RNA viruses 
with single stranded genomes (positive or negative sense) 
and reverse transcribing viruses (eg. retroviruses) showed 
similar degrees of CpG suppression that was related to 
their G+C composition, while no comparable suppression 
was observed in dsRNA viruses (Figure 6; green filled cir- 
cles p < 10" 10 ; Additional file 1: Table SI). These observa- 
tions provided tentative evidence that RNA viruses that 
expose their genomic RNA sequences to the cytoplasm 
are subject to similar selection against CpG as was evident 
in mRNA sequences. Insect viruses of any configuration 
showed no CpG under-representation. 

To investigate whether the suppression of CpG in 
RNA viruses was a response to similar mutational and 
selection pressures observed in their hosts' mRNA se- 
quences, 420 animal positive- and negative- sense 
viruses were analyzed using the 96 parameter mutational 
model (Figure 7; uncorrected RMS scores are shown in 
Additional file 8: Figure S5). As observed among mRNA 
sequences of their hosts, the main context-dependent 



mutations that reduced model error were those that 
eliminated CpG and UpA dinucleotides, prominently 
represented among both the first choice and alternative 
mutations. 

Discussion 

Modeling mutational processes 

This study investigated several unresolved issues in pre- 
vious analyses of dinucleotides and the context-sensitive 
mutational and selection biases. Specifically, are simple 
processes such as the elevated C— >T transition fre- 
quency upstream of G residues arising from methylation 
necessary and sufficient to account for the spectrum of 
skewed dinucleotide frequencies observed in mammalian 
genomic DNA sequences? A previous investigations of 
dinucleotide composition of genomic sequences in a 
range of eukaryotic phyla showing different degrees of 
methylation and CpG under-representation demonstrated 
(despite previous reports to the contrary; eg. [18]) that the 
observed CpA/TpG over-representation arose in direct 
proportion to the loss of CpG dinucleotides [13]. Using a 
modeling method on which the current study was based, 
Duret and Galtier [14] further showed that assigning an 
elevated C— >T,G rate upstream of G residues reproduced 
the G+C relationship with CpG under-representation 
in human genomic DNA and, rather counter-intuitively, 
additionally reproduced the G+C-dependent depletion 
of TpA dinucleotides also observed in human DNA se- 
quences [14]. Despite the title of that study however, the 
actual depletion of UpA is proportionately greater in 
genomic DNA than could be modelled and the further ef- 
fect of this primary mutational bias on other dinucleotide 
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representations was not analyzed. A further problem with 
this hypothesis is that UpA deficiencies are equally pro- 
nounced among organisms that lack methylation of gen- 
omic DNA and show no suppression of CpG frequencies 
(eg. A. gambiae - Additional file 2: Figure S1E). 

In the current study, we have substantially expanded 
the modelling process to allow multiple mutational 
biases and rates and used model error calculations to 
allow each to be systematically optimized rather than 
empirically assigned. The method proved robust, with 
minimal variability in predicted mutations and muta- 
tional rates of human DNA when different random se- 
lected samples were analyzed (Additional file 6: Table 
S3) or on comparison of mammalian genomic DNA and 
RNA datasets where selection pressures are expected to 
be similar between species (Table 1; Figure 5). We do ac- 
knowledge, however, that finding the simplest combin- 
ation of context-dependent mutations and associated 
mutational rates that fits the observational data is not 
necessarily the actual underlying biological process. 
However the mutational models we have discovered are 
compelling in their simplicity, efficiently account for ob- 
servational data with a minimum of parameters and pre- 
dict context-dependent mutations and rates that are 
both biologically plausible and consistent with results 
and inferences made from different approaches [19-22]. 
This applies particularly to mammalian DNA datasets 
where the use of just two mutations reduced baseline 
corrected model error to close to zero. Furthermore, the 



optimized mutational rate for the C— »T,G transition was 
comparable to estimates based on different methods. For 
example a 12-fold higher rate compared to other transi- 
tions was reported using a simple equilibrium model 
[22]. More recent maximum likelihood approaches that 
incorporate the C— >T,G transition rate in human genomic 
DNA as a separate parameter to standard substitution 
models for likelihood-optimization, arrive at mutational 
rates ranging from 8.5 (TF model; [19]) to 9.2 [21], similar 
to the modelled ll.Ox - 12.1x rates we derived for mam- 
malian DNA (Table 1). 

One unanticipated finding that supports the validity of 
the modelling method was the reconstruction of the re- 
lationship between the under-representation of CpG and 
TpA (and other dinucleotides) with G+C content. Quad- 
ratic lines of best fit through observational data super- 
imposed almost exactly on model predictions using 4 or 
fewer parameters Figure 2 and Additional file 2: Figure 
SI). This provides a simpler explanation than hypotheses 
that propose different susceptibilities of high and low 
G+C content DNA to methylation and deamination or 
different selection pressures operating on CpG islands 
that contain higher proportion of coding sequences [23]. 
For example, one widely discussed model argues that gen- 
omic DNA with a low G+C content is more susceptible to 
methylation-induced mutations that eliminate CpG dinu- 
cleotides [8,9]. The effect of replacing C with T further re- 
duces G+C composition in these regions encouraging 
further methylation and elimination of CpG dinucleotides. 
The theory provides a compelling explanation for the ex- 
istence of alternating regions of low C+G content and 
heavily methylated DNA interspersed with CpG -rich 
islands (particularly in warm-blooded animals where ele- 
vated temperatures potentially contributes to the accessi- 
bility of low G+C DNA to methylation). However, we 
have found that precisely the same relationship emerges 
from a model in which G+C content had no influence on 
methylation rate. 

Compared to this simple, single parameter modelling 
previously reported of human DNA [14], at least one 
further mutation and better optimized C— >T,G mutation 
rate and k (transition / transversion) ratio was required 
to reproduce the steeper positive (CpG) and negative 
(TpA) gradients between dinucleotide representation 
and G+C content. In the case of TpA, the use of two pa- 
rameters additionally reproduced the degree of under- 
representation of TpA observed in genomic sequences 
that was not effectively modelled in the original study. 

Previous investigation and modelling of mutations that 
create dinucleotide frequency biases have typically concen- 
trated specifically on CpG and its under-representation in 
mammalian genomes. There is therefore a dearth of pub- 
lished information to corroborate predictions for other di- 
nucleotides and among other organisms without genomic 
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methylation. There is for example little information on the 
potential existence of the highly influential G— »T,T muta- 
tion identified in mammalian DNA sequences, C— >T,T in 
D. rerio and G— >A,A among ecdysozoa. The GpT di- 
nucleotide is depleted in mammalian DNA sequences as 
well as in eubacterial and mitochondrial genomes [24], and 
consistent with the greater than expected frequency of 
SNPs involving this dinucleotide in a large scale analysis of 
human SNP data (mean 1.7-fold; Figure 4). This mutational 
bias is indeed visible although uncommented on in previ- 
ous SNP analyses of human and mouse sequences [25,26]. 
Together these findings are consistent with a greater mut- 
ability of this dinucleotide. 

Differential selection on expressed mRNA sequences 

In contrast to previous studies, investigation of muta- 
tional and/or selection biases was based on genomic se- 
quences separated into expressed directly as mRNA 
sequences and DNA sequences that are non-transcribed. 
This differentiation was particularly relevant for organisms 
with high proportions of coding and other expressed 
sequences in their genomes. This differentiation re- 
vealed several differences both in their dinucleotide 
frequency biases and in the optimised models for their 
underlying mutational and selection biases. 

The first observation was that dinucleotide frequency 
biases were often distinct between genomic DNA and 
mRNA sequences, even though the latter sequences ne- 
cessarily incorporate mutational processes operating on 
genomic DNA. The proportionately greater under- 
representation of Up A dinucleotides for a given G+C 
content observed in mRNA sequences has been previ- 
ously described [7], although this phenomenon extends 
to several other dinucleotides which show even greater 
compositional differences (such as GpA and CpA in hu- 
man mRNA; Additional file 2: Figure SIB). Further evi- 
dence that different selection may be operating on 
mRNA sequences was indicated by frequent asymmet- 
ries in complementary dinucleotides, such as CpA and 
UpG that could not have originated through mutational 
biases occurring on genomic DNA (where they are ef- 
fectively symmetrical). Mutational models developed for 
mRNA sequences showed several further differences 
from those optimised for genomic DNA sequences of 
the same organism. Most prominent was the evidence in 
all species examined for strong selection against the 
UpA dinucleotide, ranked first or second in order of in- 
fluence. In contrast, selection against UpT was either ab- 
sent (mammalian species, C. intestinalis) or substantially 
weaker in genomic DNA sequences (Table 1, Figure 5). 
Best fitting mutations that removed UpA residues were 
usually transitions {eg. U— >C,A) but showed no evidence 
of context dependence that would be expected for a mu- 
tational bias. Similarly, among the species investigated, 



the 96+1 parameter (asymmetric) model generated simi- 
lar numbers of upstream and downstream-base condi- 
tioned mutations, such as U,A— >C in mammalian 
mRNA and D. rerio and U,A— >G in A. gambiae (Table 1, 
Figure 3). This contrasted with the strict dependence of 
methylation-induced transitions on a downstream G 
residue in DNA sequences. 

Selection against UpA dinucleotides in cytoplasmically- 
expressed sequences might be expected given the role of 
the UpA dinucleotides as a recognition motif for RNAseL 
and other RNA degrading enzymes [6,7,27]. For example, 
human mRNA sequences expressed in the cytoplasm 
of CHO (hamster) cells showed greater degradation 
rates in proportion to frequencies of UpA residues in 
the cytoplasm [6]. Although there is little information 
on degradation pathways of mRNA sequences in in- 
vertebrates, the observation that UpA is consistently 
under-represented throughout eukaryotic phyla pro- 
vides some evidence for the existence of comparable 
regulatory mechanisms [7]. As suggested many years ago, 
the suppression of UpA dinucleotides among RNA viruses 
infecting mammalian, plant and insect cells (Figure 6; 
[10,11]) may therefore represent their specific adaptation 
to evade RNA degradation during their replication cycle. 
In the current study, further evidence for specific selection 
against UpA dinucleotides was provided by the mutational 
model for positive- and negative-strand animal viruses in 
which mutations removing UpA residues were ranked sec- 
ond behind those removing CpG (Figure 7). 

Selection against CpG dinucleotides in expressed 
RNA sequences 

Mutations eliminating CpG residues were also highly in- 
fluential in reducing model error for mRNA sequences 
and ranked 1 st or 2 nd in species with methylated ge- 
nomes (Figure 3). Although these arise (at least in part) 
from mutational biases in the underlying genomic se- 
quence, modelled mutational rates were invariably 
higher in mRNA sequences (Figure 5). Furthermore, in 
C. intestinalis, CpG depletion was best modelled by mu- 
tations that were dependent on the upstream base {eg. 
C,G— >A; Figure 3 and data not shown). These observa- 
tions provide evidence that additional, likely selective 
rather than mutational pressures against CpG dinucleo- 
tides are exerted on RNA sequences expressed in the 
cytoplasm. The existence of this selection pressure oper- 
ating independently of DNA methylation induced muta- 
tion is supported by our finding of mutational biases 
against CpG dinucleotides among RNA viruses (Figure 7) 
in which conventional deamination and mutation as a 
consequence of methylation cannot occur. This selection 
process may underlie the prominent under-representation 
of CpG dinucleotides in many classes of RNA virus 
[10,11] infecting mammals and plants to extents 
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comparable to those observed in their hosts' mRNA se- 
quences (Figure 2, Additional file 7: Figure S4A, S4B; 
[28]). Prominent exceptions to CpG under-representation 
are viruses with dsRNA genomes (Figure 6) and many of 
the helical-classed plant viruses (data not shown). In these, 
however, RNA genomic sequences remain packaged 
within virions throughout their replication cycle and they 
therefore may not be subject to the same selection pres- 
sures operating on exposed RNA. 

It could be argued that host cell defences against viral 
infections that mutate their RNA genomes may account 
for the various under- and over-representations of spe- 
cific dinucleotides. Of these, members of the APOBEC 
family deaminate cytosines in single-stranded DNA and 
RNA potentially in specific sequence contexts [29] 
although those identified (C,C->U and U,C->U) would 
not create the dinucleotide biases in RNA viruses and of 
course the action of APOBEC is specific to retroviral ge- 
nomes, not the RNA viruses modelled in the current 
study. A different RNA editing enzyme that is interferon- 
induced and known to be active against RNA viruses is ad- 
enosine deaminase acting on RNA 1 (ADAR1). However, 
its mutagenic effect is not known to be dependent on di- 
nucleotide context [30] and therefore similarly cannot 
create the frequency biases observed. 

Although the nature of the selection against CpG di- 
nucleotides remains poorly understood and has not been 
investigated functionally, there are a number of tantaliz- 
ing clues towards the existence of mechanisms coupled 
to innate immunity that recognize RNA with CpG 
motifs [28,31,32]. There may be, for example, RNA- 
degrading enzymes that recognise CpG motifs, analogous 
to UpA targeting by RNAseL and other RNA degrading 
enzymes that influence mRNA half-lives in the cytoplasm 
(see above). Alternatively, CpG dinucleotides in viral 
RNA may be selected against as they may serve as 
targets for currently uncharacterized pathogen recog- 
nition receptors couple to interferon or other cell de- 
fence pathways [12]. The induction of interferon- 13 in 
macrophages exposed to synthetic RNA oligonucleotides 
containing CpG residues [33] may be an example of this 
process, functionally and perhaps evolutionarily related to 
Toll-like receptor 9 that recognizes non-methylated CpG 
dinucleotides in DNA sequences. 

Further evidence that the presence of CpG dinucleo- 
tides in viral sequences either activate or are targets of 
cell defence mechanisms is provided by the observation 
that polioviruses with artificially elevated CpG frequen- 
cies in their genomic RNA were markedly attenuated 
and replicated to titres several orders of magnitude 
lower than wild type virus in in vitro cell culture [34-36] . 
Intriguingly, cellular genes coding for proteins induced 
as part of the innate response to infection, such as type 
1 interferons, show substantially greater depletion of 



CpG dinucleotides than other genes of similar G+C 
composition [31], suggesting that this adaptation is re- 
quired for effective gene expression in a hostile cytoplas- 
mic environment. Mammals (and potentially other 
vertebrates) and plants with their methylated genomes 
and associated depletion of CpG may therefore have in- 
dependently co-opted this dinucleotide as a marker of 
self/non-self recognition. This potentially explains the 
selection against CpG in viruses infecting members of 
these eukaryotic phyla [28] . The existence of such recog- 
nition systems may in turn have placed additional selec- 
tion pressures on host expressed mRNA sequences to 
evade these viral countermeasures. 

Conclusions 

The findings in the current study provide the first com- 
prehensive analysis of context-dependent mutational 
biases and selection pressures in organisms with both 
methylated and non-methylated genomes. The finding of 
pressures operating on genomic DNA in addition to the 
previously described C— »T,G mutation in mammals, a 
set of quite different biases in non-methylated genomes 
and additional selection pressure operating on sequences 
expressed as mRNAs in all organisms provides a series 
of predictions that can be directly analyzed in biological 
studies. The evidence obtained for selection pressures 
against UpA and CpG dinucleotides in mRNA sequences 
of methylated organisms provides a coherent explan- 
ation of their under-representation in cytoplasmically 
replicating RNA viruses which has eluded previous ana- 
lyses [10,11]. It provides exciting new insights into the 
process of self / non-self recognition that underlies host 
innate immunity to viral pathogens. 

Methods 

Sequences and dinucleotide frequency calculation 

DNA sequences from human (Homo sapiens), other mam- 
mals (chimpanzee [P. troglodytes], mouse [M. musculus]), 
another vertebrate (zebra fish [D. rerio]) and other animals 
(sea squirt [C. intestinalis], fruit fly [D. melanogaster], 
mosquito [A gambiae] and nematode [C. elegans]) were 
the subject of the investigation. Genome sequences were 
obtained from UCSC for the following genome versions: 
H. sapiens - hgl9; P. troglodytes - panTro3; M. musculus - 
mm9; D. rerio - danRer7; C. intestinalis - ci2; C. elegans - 
celO; A gambiae - anoGaml; D. melanogaster, dm3. 

Exon coordinates were extracted from UCSC using the 
table browser function using the following tables: hgl9: 
knownGenes; panTro3: refGene; mm9: knownGenes; 
danRer7: refGene; ci2: refGene; celO: refGene; anoGaml: 
refGene; dm3: refGene. Sequences corresponding to 
exon coordinates were removed and the remaining non- 
cytoplasmically expressed DNA genomic sequences were 
divided into 50,000 bp lengths for analysis. 
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From each species, non-redundant mRNA sequences 
were downloaded from the http://www.ncbi.nlm.nih.gov/ 
gene database, with sequences shorter than 2500 bases 
excluded. Complete genome sequences from available 
positive and negative stranded RNA viruses infecting 
mammals were obtained from GenBank (Additional file 
9: Table S4). The analysis used non-redundant sequences 
curated in the RefSeq project comprising prototype or 
reference sequences from each virus family, and species. 

SNPs in human DNA and their immediate 5' and 3' 
bases were obtained from the NCBI dbSNP database 
(ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/) 
on 18/01/12. The bases immediately adjacent to the 
44,415,612 SNPs were extracted by parsing FASTA 
files, ignoring insertion/deletion polymorphisms. 

Mono- and dinucleotide frequencies and ratios of ob- 
served dinucleotide frequencies to those expected from 
mononucleotide composition (G+C content in the case 
of DNA sequences) were calculated using the program 
Composition Scan in the SSE package [37]. 

Modelling substitution rates in different dinucleotide 
contexts 

We developed a systematic model to determine optimal 
mutation rates in each dinucleotide context that best 
correlate with DNA and RNA composition of eukaryotic 
and viral sequences. These rates can viewed as variations 
from a default rate transformation matrix, Q: 
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where Qyw f° r distinct Y and W is the default rate of 
transformation from nucleotide Y to nucleotide W. k is 
the transition to transversion ratio and, for the default 
transformation rates given by the matrix Q, 0 is the 
equilibrium proportion of G+C mononucleotides. It is 
assumed that these rates can be influenced independ- 
ently by the two neighbouring nucleotides to Y as fol- 
lows. For a given trinucleotide XYZ the mutation rate 
from Yto a different nucleotide Wis given by: 

r(X, Y^W.Z) =f(X, Y^W) Q YW f{Y^W,Z) 

where f(X, Y—>W) is the factor giving the change of the 
mutation rate of Y—>W from its default value when the 
upstream nucleotide is X, and f(Y—>W, Z) is the factor 
when the downstream nucleotide is Z, and both factors 
contribute independently. For example iff(X, Y—>W) and 
J(Y—>W, Z) were changed from their default values of 1 
to values 0.7 and 2, then the mutation rate r(X, Y-^>W, 
Z) would increase by a factor of 1.4 from its default 



value of Qyw. This model generalizes Duret and Galtier s 
model [14], in which k = 2.1, j{C-^T, G) = 27.6/2.1 and 
the rest of factors all equal to 1.0. 

For each of the 4 nucleotides, Y, there are 3 possible 
transitions to a different nucleotide, W giving 12 possible 
transitions Y—>W. Since there are 4 possible upstream 
nucleotides X, there are 48 factors of the form f(X,Y—>W) 
and since there are 4 downstream nucleotides Z there 
are 48 factors of the form f(Y—>W, Z), giving a total of 
96 factors in the model. (Note that this is half the num- 
ber of factors that would be needed in a model that had 
a factor for each of the 12 possible transitions and each 
of the 16 combination of upstream and downstream nu- 
cleotides.) In RNA all the 96 factors in the model are 
independent. However in DNA there is strand symmetry 
which leads to equal rates of mutation in complemen- 
tary DNA strands. Consequently if X\ Y' and W are the 
complementary nucleotides to X, Y and W respectively, 
then f(X, Y^W) = f(Y'^>W, X'). Hence for DNA there 
are only 48 independent factors. 

For any specified set of mutational rates r(X, Y-^>W, Z) 
we can simulate the mutational process starting from 
some arbitrary compositions until an equilibrium is 
reached. The method used is as follows. 

Let dij(u) be the proportion at time u of all the dinu- 
cleotides that is dinucleotide ij, and let m/u) be the pro- 
portion of all the nucleotides that is nucleotide In our 
model the m/u) and dij(u) are related by: 

mj(u) = Yudij(u) = Yjdji(u) (1) 

i i 

The first sum is over all the dinucleotides ij where ; is 
the downstream nucleotide, and the second sum is over 
all dinucleotides ji where / is the upstream nucleotide. 
The reason that these are the same is that in our model 
we assume an arbitrary long RNA or DNA sequence so 
every nucleotide occurs once in a dinucleotide as its up- 
stream nucleotide and once in a dinucleotide as its 
downstream nucleotide. Each time there is a transition 
in our model the change affects equally the nucleotide 
where it is the upstream nucleotide and the dinucleotide 
where it is the downstream nucleotide. Hence provided 
the two sums are the same at the start of the simulation 
they will remain the same throughout. 

Let tij k (u) be the proportion at time u of all trinucleo- 
tides that is ijk. The trinucleotide ijk consists of an up- 
stream dinucleotide ij and a downstream dinucleotide jk 
sharing a common middle nucleotide Following previ- 
ous approaches [14] we assume that in trinucleotides the 
up and downstream dinucleotides that share a common 
middle nucleotide are independent. From this it follows 
that tijk(u) = dij(u) Pjk\j(u), where Pjky(u) is the proportion 
at time u of dinucleotide jk among all the dinucleotides 
whose left nucleotide is Since P ;k y(u)= dj k (u)/ m ; (u), it 
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follows that: 



tijk(u) 



dij(u)djk(u) 
mj(u) 



(2) 



The rate of change in the proportion of dinucleotide ij 
is given by the equation: 

d ^P-{u) = £ tij k (u)Yur(i,j^>l,k)b((x,y), (/,/-►/,£)) 
du iik i 

(3) 

where b((x, y), (i, j—>l, k))) is the change in the number 
of dinucleotides xy when a trinucleotide ijk turns to a 
trinucleotide ilk, i.e., the number of dinucleotides xy in 
trinucleotide ilk minus the number in trinucleotide ijk. 
For example, b((C, A), (C, A— >T, G)) = -1 since one 
CpA is lost by changing from CAG to CTG. Also b{{T, 
T), (T, C^T, T)) = 2, and 6((A, A), (G, G^T, C)) = 0. It 
is not difficult to recognize that the value b({x,y), 
m, k))) can take is - 2, - 1, 0, 1 or 2. 

Let F denote the vector consisting of k and all the fac- 
tors / For any given values of 6 and F } the steady state 
dinucleotide proportions can be found by substituting 
(1) and (2) into (3) and integrating the resulting 16 
nonlinear equations from an arbitrary starting compos- 
ition until the proportions stabilize. Let d XY be the limit- 
ing proportion of dinucleotide XY and let m x be the 
limiting proportion of nucleotide X The limiting C+G 
proportion, o) equals m c + m G and for each dinucleotide 
XY we can calculate the models prediction of the ob- 
served to expected dinucleotide ratio, XpYo/e, from 
d XY /(m x m Y X (If there was no correlation between the 
nucleotides in dinucleotides then this ratio would be 1.) 
By tabulating C+G and the resulting XpYo/e for a range 
of values 6 and interpolating we can find for any value o 
of C+G, the models estimate of XpYo/e. We denote this 
function by M XY (o),F). 

To assess how good a fit our model is to a set of sam- 
ples, the root mean square error, RMS, between the 
model and the data was calculated. Assume there are N 
samples and sample n has a C+G proportion of o) n and 
the o/e ratio of the XY dinucleotide is R n>xy . Then for a 
vector of parameters F the RMS is: 



RMS(F) 



\ 



^^(Rn,xy-M xy (G) n ,F)f 

n=l x,y 



16N 



The goal is to find the minimum value of RMS(F) and 
the corresponding value of the parameters F, which gives 
the best fit to the data. However we are interested in so- 
lutions in which only a small number of the factors, f, 
deviate from their default value of 1. The calculation is 



done in a series of stages. First we find the minimum 
values of RMS(F) for k and each single factor in turn, 
and choose the factor that gives the best fit. Then we 
repeat the calculation allowing the values of k and 
the previously selected factor and each other factor in 
turn to vary from their default values, and find which 
other factor allows the best reduction in RMS value. 
Then this is repeated to select at each step the best add- 
itional factor to add to the previously selected ones 
allowing at every step the re-optimisation of the values of 
k and /for each of the previously selected nucleotides. 

The steps are shown below. Here M is the total num- 
ber of parameters to vary {i.e. either 49 or 97) and 
Max_K is the maximum number of parameters we want 
to allow to deviate from their default values. (In the 
DNA case the remaining parameters are set equal to 
their complementary parameter.) We number the pa- 
rameters f 0 = k, fj = f{A, A^C ), f 2 = f{A, A^G ), ... 
Variable R° pt denotes the minimum value of RMS found 
when only the parameters in D are allowed to vary, and 
F° pt the corresponding vector of parameter values. 



Initialise F = (f 0 ,fi,f 2 f M ) := (2 . 1, 1.0, 1.0,..., 1.0) 
S := {0}; P :={0,1,...,M}; Best_R 0pt = qo 
for j := 1 ... Max_K 

for i in p\s (i.e. set pwith elements of set s removed) 

i> : = sui 

R 0pt , F 0pt := min RMS(F) varying all parameters fj for i g t> 
if R 0pt < Best_R 0pt 

Best_R 0pt := R 0pt ; Best_s : = t>; Best_F := F 0pt 
end if 
end for 

Output Best_R, Best.s, Best_F 0pt as solution for k 
variable parameters 
S := Best_s ; F : = Best_F 0pt 

end for 



Although the mutation rate/(W— >X, Y) is a separate par- 
ameter in the model from f(X^>W, Y), their effects are re- 
lated: setting J(W—>X, Y) equal to a value v usually has a 
similar effect on the equilibrium compositions to setting / 
(X—>W, Y) equal to the value 1/v. Consequently when pre- 
senting the results only the factor greater than 1 is reported. 
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Strand symmetry in DNA sequences 

Models used and the information that can be obtained 
from modelling mutational biases and selection pressure 
depends on the nature of the nucleic acid. Studies to 
date have been performed on genomic DNA; without 
evident polarity in its replication in eukaryotes, the ac- 
tual number of independent dinucleotides amounts to 
only 10. These are ApT, TpA, CpG and GpC (self-com- 
plementary dinucleotides) and the following pairs which 
are present in equal frequencies in a large enough se- 
quence sample; ApA and UpU, GpG/CpC, CpA/UpG, 
ApC/GpU, GpA/UpC and ApG/CpU. As described 
above in the model description, this symmetry leads to 
mutational biases dependent on a downstream base being 
indistinguishable from a complementary bias dependent 
on an upstream base. Thus, the well characterised 
methylation-induced mutation, represented here as C— >T, 
G is formally equivalent a complementary process on the 
opposite DNA strand, i.e. C,G— >A. In the current study, 
DNA mutations are by convention generally presented in 
the former format. 

Modeling dinucleotide biases in single stranded 
(RNA) sequences 

For RNA sequences, different considerations apply. Mu- 
tational biases originating from context-dependent mu- 
tational biases will typically be symmetrical if originating 
from biases in the underlying DNA sequence from which 
it was transcribed, or in an RNA virus sequences where 
the same RNA polymerase transcribes sense and antisense 
genomic sequences. On the other hand, mutational biases 
from RNA polymerase II that transcribes mRNA sequences 
and dinucleotide composition abnormalities originating 
from selection in the cytoplasm lead to asymmetries that 
need to be separately modelled. As described above, model- 
ling of mutational / selection biases in RNA therefore con- 
siders each dinucleotide separately {e.g. the frequency of 
UpC does not necessarily equal the frequency of GpA and 
as described in the previous section, mutations occurring in 
both upstream and downstream dinucleotide contexts have 
to be modelled separately, creating a total of 96 instead of 
48 model parameters. 

Availability of supporting data 

(Additional file 2: Figure SI, Additional file 3: Figure S2 
and Additional file 5: Figure S3) and (Additional file 1: 
Table SI, Additional file 4: Table S2, Additional file 6: Table 
S3 and Additional file 9: Table S4) are available from: 

Additional files 



Additional file 1: Table SI. Significance testing of differences in cpg 
and upa frequencies. 



Additional file 2: Figure SI. Observed / expected frequencies of all 16 
dinucleotides in human DNA (Additional file 2: Figure S1A) and mRNA 
sequences (Additional file 2: Figure S1B), D. rerio DNA and mRNA 
sequences (S1C, S1D) and A. gambiae DNA and mRNA sequences (S1E, 
S1F). Values (y-axis) were plotted as a function of G+C content (x-axis). 
Frequencies of each dinucleotide predicted from mutational models with 
1, 2 and 4 parameters (1p, 2p and 4p; see inset key) are superimposed on 
each distribution along with the quadratic line of best fit for each dataset 
generated from starting sequences ranging in G+C composition from 
20%-80%. 

Additional file 3: Figure S2. Relationship between fragment length 
and modelled RMS scores of human DNA fragments of different lengths 
using 4 parameters. (A) Fragment lengths depicted in a linear scale. (B) 
To estimate RMS distances for sequences without sampling error {je. for 
sequences of infinite length), sequence lengths were transformed using 
the empirically derived transformation 1/length 042 to generate a linear 
relationship with RMS distances. The intercept with the y-axis line 
represents the RMS score for sequences of infinite length (0.0095). This 
represents the model error for this dataset. 

Additional file 4: Table S2. Measured and predicted minimum rms 
scores for dna and mrna datasets from different organisms. 

Additional file 5: Figure S3. Uncorrected model errors (y-axis) using 
mutational models with between 1 to 4 context-dependent mutational 
biases (labelled under graph line) formatted as in Figure 3. 

Additional file 6: Table S3. Reproducibility of corrected model error 
and mutational rates on data re-sampling. 

Additional file 7: Figure S4. Observed / expected frequencies of all 16 
dinucleotides of mammalian viral RNA sequences (see legend to 
Additional file 2: Figure S1). 

Additional file 8: Figure S5. Uncorrected model errors (y-axis) for 
mammalian RNA viruses using mutational models with between 1 to 4 
context-dependent mutational biases formatted as in Figure 3. 

Additional file 9: Table S4. Listing of mammalian viral sequences 
analysed in study. 
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